In [40]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import kagglehub
import os
import plotly.graph_objects as go
import plotly.express as px
from IPython.display import display
import warnings
from plotly.subplots import make_subplots
warnings.filterwarnings('ignore')
print(f"Numpy version: {np.__version__}")
Numpy version: 1.23.4
In [41]:
# Download latest version
path = kagglehub.dataset_download("muhammadroshaanriaz/students-performance-dataset-cleaned")

print("Path to dataset files:", path)
os.listdir(path)
data = pd.read_csv(path + "/Cleaned_Students_Performance.csv")
Path to dataset files: /home/gapostolides/.cache/kagglehub/datasets/muhammadroshaanriaz/students-performance-dataset-cleaned/versions/1
In [42]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   gender                       1000 non-null   int64  
 1   race_ethnicity               1000 non-null   object 
 2   parental_level_of_education  1000 non-null   object 
 3   lunch                        1000 non-null   int64  
 4   test_preparation_course      1000 non-null   int64  
 5   math_score                   1000 non-null   int64  
 6   reading_score                1000 non-null   int64  
 7   writing_score                1000 non-null   int64  
 8   total_score                  1000 non-null   int64  
 9   average_score                1000 non-null   float64
dtypes: float64(1), int64(7), object(2)
memory usage: 78.2+ KB
In [43]:
data.head(5)
Out[43]:
gender race_ethnicity parental_level_of_education lunch test_preparation_course math_score reading_score writing_score total_score average_score
0 0 group B bachelor's degree 1 0 72 72 74 218 72.666667
1 0 group C some college 1 1 69 90 88 247 82.333333
2 0 group B master's degree 1 0 90 95 93 278 92.666667
3 1 group A associate's degree 0 0 47 57 44 148 49.333333
4 1 group C some college 1 0 76 78 75 229 76.333333

Check for missing values¶

In [44]:
nans_count = data.isnull().sum()
print(nans_count)
gender                         0
race_ethnicity                 0
parental_level_of_education    0
lunch                          0
test_preparation_course        0
math_score                     0
reading_score                  0
writing_score                  0
total_score                    0
average_score                  0
dtype: int64

No need for data imputation as there are no missing values.

Features¶

  • Features (x)

    • Gender: Useful for analyzing performance differences between male and female students.

    • Race/Ethnicity: Allows analysis of academic performance trends across different racial or ethnic groups.

    • Parental Level of Education: Indicates the educational background of the student's family.

    • Lunch: Shows whether students receive a free or reduced lunch, which is often a socioeconomic indicator.

    • Test Preparation Course: This tells whether students completed a test prep course, which could impact their performance.

  • Variables of Interest (y)

    • Math Score: Provides a measure of each student’s performance in math, used to calculate averages or trends across various demographics.

    • Reading Score: Measures performance in reading, allowing for insights into literacy and comprehension levels among students.

    • Writing Score: Evaluates students' writing skills, which can be analyzed to assess overall literacy and expression.

    • Total Score: Cumulative Score at Reading, Maths and Writing

Exploratory Data Analysis (EDA)¶

Initial view at the data

In [45]:
# Show the data metrics
data.describe(include='all')
Out[45]:
gender race_ethnicity parental_level_of_education lunch test_preparation_course math_score reading_score writing_score total_score average_score
count 1000.000000 1000 1000 1000.000000 1000.000000 1000.00000 1000.000000 1000.000000 1000.000000 1000.000000
unique NaN 5 6 NaN NaN NaN NaN NaN NaN NaN
top NaN group C some college NaN NaN NaN NaN NaN NaN NaN
freq NaN 319 226 NaN NaN NaN NaN NaN NaN NaN
mean 0.482000 NaN NaN 0.645000 0.358000 66.08900 69.169000 68.054000 203.312000 67.770667
std 0.499926 NaN NaN 0.478753 0.479652 15.16308 14.600192 15.195657 42.771978 14.257326
min 0.000000 NaN NaN 0.000000 0.000000 0.00000 17.000000 10.000000 27.000000 9.000000
25% 0.000000 NaN NaN 0.000000 0.000000 57.00000 59.000000 57.750000 175.000000 58.333333
50% 0.000000 NaN NaN 1.000000 0.000000 66.00000 70.000000 69.000000 205.000000 68.333333
75% 1.000000 NaN NaN 1.000000 1.000000 77.00000 79.000000 79.000000 233.000000 77.666667
max 1.000000 NaN NaN 1.000000 1.000000 100.00000 100.000000 100.000000 300.000000 100.000000

Distributions & Center Metrics of the datset¶

In [46]:
#Computing the means, medians and modes of the data
# Numerical data
print(f"Numerical data Metrics:")
display(data.drop(columns=["gender","lunch","test_preparation_course"]).describe())

print(f"Categorical data Metrics:")
display(data.drop(columns=["math_score","reading_score","writing_score","total_score","average_score"]).astype("category").describe())
Numerical data Metrics:
math_score reading_score writing_score total_score average_score
count 1000.00000 1000.000000 1000.000000 1000.000000 1000.000000
mean 66.08900 69.169000 68.054000 203.312000 67.770667
std 15.16308 14.600192 15.195657 42.771978 14.257326
min 0.00000 17.000000 10.000000 27.000000 9.000000
25% 57.00000 59.000000 57.750000 175.000000 58.333333
50% 66.00000 70.000000 69.000000 205.000000 68.333333
75% 77.00000 79.000000 79.000000 233.000000 77.666667
max 100.00000 100.000000 100.000000 300.000000 100.000000
Categorical data Metrics:
gender race_ethnicity parental_level_of_education lunch test_preparation_course
count 1000 1000 1000 1000 1000
unique 2 5 6 2 2
top 0 group C some college 1 0
freq 518 319 226 645 642
In [47]:
# Pair Plot for Numerical data
plt.figure(figsize=(20, 10))
sns.pairplot(data=data,vars=["math_score","reading_score","writing_score","average_score"],diag_kind="auto")
plt.show()
<Figure size 2000x1000 with 0 Axes>
No description has been provided for this image
  • All distributions for the 4 scores roughly follow a normal distribution.
  • Strong linear relationship between average score and the other scores, which is to be expected as it is an infered metric from the rest.
  • Strong linear relationship shown in the scatterplots between the scores (especially between reading and writing scores suggesting strong corelation).
In [49]:
import plotly
plotly.offline.init_notebook_mode())
gender_counts = data["gender"].map({0: "male", 1: "female"}).value_counts()
gendet_counts_labels = gender_counts.index


lunch_counts = data["lunch"].map({0: "standard", 1: "free/reduced"}).value_counts()
lunch_labels = lunch_counts.index

fig = make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]])
fig.add_trace(go.Pie(labels=gendet_counts_labels, values=gender_counts, name="Gender",textfont=dict(size=25),marker_colors=['#000042', '#A4303F']),
              1, 1)
fig.add_trace(go.Pie(labels=lunch_labels, values=lunch_counts, name="Lunch",textfont=dict(size=25),marker_colors=['#F5A300','#177E89']),
              1, 2)


# Use `hole` to create a donut-like pie chart
fig.update_traces(hole=.4, hoverinfo="label+percent+name")

fig.update_layout(
    # Add annotations in the center of the donut pies.
    annotations=[dict(text='Gender', x=sum(fig.get_subplot(1, 1).x) / 2, y=0.5,
                      font_size=30, showarrow=False, xanchor="center"),
                 dict(text='Lunch', x=sum(fig.get_subplot(1, 2).x) / 2, y=0.5,
                      font_size=30, showarrow=False, xanchor="center")],
    
    legend=dict(font=dict(size=25),title=dict(text="Populations")),
    title=dict(text="Categorial Data Populations",font=dict(size=30)))

fig.show()
  File <tokenize>:22
    annotations=[dict(text='Gender', x=sum(fig.get_subplot(1, 1).x) / 2, y=0.5,
    ^
IndentationError: unindent does not match any outer indentation level
In [ ]:
edu_counts = data["parental_level_of_education"].value_counts()
edu_labels = data["parental_level_of_education"].value_counts().index

ethnicity_scores = data["race_ethnicity"].value_counts()
ethnicity_labels = ethnicity_scores.index


color_discrete_sequence_2=["#C3F73A","#306B34","#EF065B","#64A7CE","#5716A2",]
color_discrete_sequence_1=[
                 px.colors.qualitative.Dark2[0],
                 px.colors.qualitative.Dark24[7],
               px.colors.qualitative.Dark2[2],
                 px.colors.qualitative.Dark2[6],
               px.colors.qualitative.Safe[1],
               px.colors.qualitative.Alphabet[13],]
            
fig2 = make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]])
fig2.add_trace(go.Pie(labels=edu_labels, values=edu_counts, name="Parent Edu",textfont=dict(size=25),marker_colors=color_discrete_sequence_1),
              1, 1)
fig2.add_trace(go.Pie(labels=ethnicity_labels, values=ethnicity_scores, name="Ethnicity",textfont=dict(size=25),marker_colors=color_discrete_sequence_2),
              1, 2)
fig2.update_traces(hole=.6, hoverinfo="label+percent+name")
fig2.update_layout(
    # Add annotations in the center of the donut pies.
    annotations=[dict(text=f'Parental<br>Education', x=sum(fig.get_subplot(1, 1).x) / 2, y=0.5,
                      font_size=30, showarrow=False, xanchor="center"),
                 dict(text='Ethnicity', x=sum(fig.get_subplot(1, 2).x) / 2, y=0.5,
                      font_size=30, showarrow=False, xanchor="center")],
    
    legend=dict(font=dict(size=12),title=dict(text="Populations")),
    title=dict(text="",font=dict(size=30)))
fig2.show()

Box Plots¶

In [11]:
test_list = ["math_score","reading_score","writing_score"]
# Creating an average score column
data["average_score"] = data[test_list].mean(axis=1)
test_list.append("average_score")
In [12]:
# Convert the data into long format
genderwise_scores = pd.melt(data, id_vars=["gender"], 
                    value_vars=test_list,
                    var_name="Subject", value_name="Score")

# Replace gender values with labels
genderwise_scores["gender"] = genderwise_scores["gender"].map({1: "Male", 0: "Female"})

# Create the box plot
plt.figure(figsize=(10, 6))
sns.boxplot(x="Subject", y="Score", hue="gender", data=genderwise_scores, palette=["blue", "pink"])

# Adding titles and labels
plt.title("Scores by Gender in Different Subjects")
plt.xlabel("Subject")
plt.ylabel("Score")

# Display the plot
plt.show()
No description has been provided for this image
In [13]:
lunchwise_scores = pd.melt(data, id_vars=["lunch"], 
                            value_vars=test_list,
                            var_name="Subject", value_name="Score")

lunchwise_scores["lunch"] = lunchwise_scores["lunch"].map({1: "Standard", 0: "Free/Reduced"})
plt.figure(figsize=(10, 6))
sns.boxplot(x="Subject", y="Score", hue="lunch", data=lunchwise_scores, palette=["green", "red"])
plt.title("Scores by Lunch Type in Different Subjects")
plt.xlabel("Subject")
plt.ylabel("Score")
plt.show()
No description has been provided for this image
In [14]:
testprepwise_scores = pd.melt(data, id_vars=["test_preparation_course"],
                              value_vars=test_list,var_name="Subject",value_name="Score")

testprepwise_scores["test_preparation_course"] = testprepwise_scores["test_preparation_course"].map({1:"Taken",0:"Not Taken"})
plt.figure(figsize=(10,6))
sns.boxplot(x="Subject",y="Score",hue="test_preparation_course",data=testprepwise_scores,palette=["orange","purple"])
plt.title("Scores by Test Preparation Course in Different Subjects")
plt.xlabel("Subject")
plt.ylabel("Score")
plt.show()
No description has been provided for this image
In [15]:
# Get sorted unique labels from the race_ethnicity column
race_labels = sorted(data["race_ethnicity"].unique().tolist())

racewise_scores = pd.melt(data, id_vars=["race_ethnicity"], 
                          value_vars=test_list, 
                          var_name="Subject", value_name="Score")

# Create the box plot
plt.figure(figsize=(10, 6))
sns.boxplot(x="Subject", y="Score", hue="race_ethnicity", data=racewise_scores, hue_order=race_labels)
plt.title("Race-wise Scores in Different Subjects")
plt.xlabel("Subject")
plt.ylabel("Score")
plt.legend(title="Ethnicity",ncols=5)
plt.show()
No description has been provided for this image
In [16]:
parent_edu_labels = data["parental_level_of_education"].unique().tolist()
parent_edu_scores = pd.melt(data,id_vars=["parental_level_of_education"],value_vars=test_list,var_name="Subject",value_name="Score")

plt.figure(figsize=(10,6))
sns.boxplot(x="Subject",y="Score",hue="parental_level_of_education",data=parent_edu_scores,hue_order=parent_edu_labels)
plt.title("Parental Education-wise Scores in Different Subjects")
plt.xlabel("Subject")
plt.ylabel("Score")
plt.legend(title="Parental Education",loc="lower left",ncol=3)
# plt.ylim(-30,120)
plt.show()
No description has been provided for this image

Correlation Analysis¶

  • Examining correlation between average score and the different features
In [17]:
# Create a correlation matrix
# Does not make sense to calculate correlation between categorical variables
corelation_data = data.drop(columns=["reading_score","writing_score","math_score","total_score"])
correlation = corelation_data.corr()
plt.figure(figsize=(10,6))
sns.heatmap(correlation,annot=True,cmap="viridis")
plt.title("Correlation Matrix")
plt.show()
No description has been provided for this image
  • Indicates high correlation between the type of lunch and the average score.
  • Indicates high correlation between test preparation score and the average score.
  • Two most significant features are: lunch type and test_preparation_course.
In [18]:
# Correlation between scores
score_correlation = data[test_list].drop(columns="average_score").corr()
plt.figure(figsize=(10,6))
sns.heatmap(score_correlation,annot=True,cmap="viridis")
plt.title("Correlation between Scores")
plt.show()
No description has been provided for this image
  • Scores are strongly correlated with each other meaning if one student has a good score at reading then its likely to have also a good score at writing.
  • Comparatevely there is less corelation between math score and reading and writing scores.
In [19]:
#Checking the correlation between features and individual scores
# Create a correlation matrix
test_list = ["math_score", "reading_score", "writing_score"]
def individual_score_correlation(data, checked_test):
    correlation_data = data.drop(columns=checked_test).corr()
    plt.figure(figsize=(10, 6))
    sns.heatmap(correlation_data, annot=True, cmap="viridis")
    plt.title("Correlation Matrix")


for subject in test_list:
    tests = ["math_score", "reading_score", "writing_score", "average_score","total_score"]
    tests.remove(subject)
    individual_score_correlation(data, tests)
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
  • There is a strong correlation between lunch and scores in general.
  • Strongest correlation is shown between lunch and math scores.

T-test¶

  • Employeed for binary features only. For multiclass features we will use ANOVA.
  • Employeeing t-test to quantify difference between gender in average score.
In [20]:
from scipy.stats import ttest_ind

print("*"*65)
print(" "*23+"T-Tests for Gender")
print("*"*65,end="\n")

for subject in test_list:
    male_scores = genderwise_scores[genderwise_scores["gender"]=="Male"]
    male_math_scores = male_scores[male_scores["Subject"]==subject]["Score"]

    female_scores = genderwise_scores[genderwise_scores["gender"]=="Female"]
    female_math_scores = female_scores[female_scores["Subject"]==subject]["Score"]

    # Perform t-test
    t_statistics, p_values = ttest_ind(male_math_scores,female_math_scores)
    print(f"Significance Difference between Male and Female for {subject}")
    print("="*65)
    print("T statistics:", t_statistics)
    print("P values:", p_values)

    #Using a significance level of 0.05 ==> Confidence level of 95%
    if p_values < 0.05:
        print("Null hypothesis rejected. There is a significant difference\n")
        
print("\n"+"*"*75)
print(" "*25+"T-Tests for Lunch Type")
print("*"*75, end="\n"*2)

for subject in test_list:
    standard_scores = lunchwise_scores[lunchwise_scores["lunch"]=="Standard"]
    standard_math_scores = standard_scores[standard_scores["Subject"]==subject]["Score"]

    free_scores = lunchwise_scores[lunchwise_scores["lunch"]=="Free/Reduced"]
    free_math_scores = free_scores[free_scores["Subject"]==subject]["Score"]

    # Perform t-test
    t_statistics, p_values = ttest_ind(standard_math_scores,free_math_scores)
    print(f"Significance Difference between Standard and Free/Reduced for {subject}")
    print("="*75)
    print("T statistics:", t_statistics)
    print("P values:", p_values)

    #Using a significance level of 0.05 ==> Confidence level of 95%
    if p_values < 0.05:
        print("Null hypothesis rejected. There is a significant difference\n")
*****************************************************************
                       T-Tests for Gender
*****************************************************************
Significance Difference between Male and Female for math_score
=================================================================
T statistics: 5.383245869828983
P values: 9.120185549328822e-08
Null hypothesis rejected. There is a significant difference

Significance Difference between Male and Female for reading_score
=================================================================
T statistics: -7.959308005187657
P values: 4.680538743933289e-15
Null hypothesis rejected. There is a significant difference

Significance Difference between Male and Female for writing_score
=================================================================
T statistics: -9.979557910004507
P values: 2.019877706867934e-22
Null hypothesis rejected. There is a significant difference


***************************************************************************
                         T-Tests for Lunch Type
***************************************************************************

Significance Difference between Standard and Free/Reduced for math_score
===========================================================================
T statistics: 11.837180472914612
P values: 2.4131955993137074e-30
Null hypothesis rejected. There is a significant difference

Significance Difference between Standard and Free/Reduced for reading_score
===========================================================================
T statistics: 7.451056467473455
P values: 2.0027966545279011e-13
Null hypothesis rejected. There is a significant difference

Significance Difference between Standard and Free/Reduced for writing_score
===========================================================================
T statistics: 8.009784197834758
P values: 3.1861895831664765e-15
Null hypothesis rejected. There is a significant difference

ANOVA Test¶

  • Performing an anova test for significant difference for:
    • Different parental education level
    • Different ethnicities
In [21]:
from scipy.stats import f_oneway

# Performing one-way ANOVA for test parental education level
print("*"*75)
print(" "*15+"One-Way ANOVA for Parental Education Level")
print("*"*75, end="\n"*2)

for subject in test_list:
    edu_scores = parent_edu_scores[parent_edu_scores["Subject"]==subject]
    edu_scores = [edu_scores[edu_scores["parental_level_of_education"]==edu]["Score"] for edu in parent_edu_labels]

    # Perform one-way ANOVA
    f_statistics, p_values = f_oneway(*edu_scores)
    print(f"Significance Difference between Parental Education Levels for {subject}")
    print("="*75)
    print("F statistics:", f_statistics)
    print("P values:", p_values)

    #Using a significance level of 0.05 ==> Confidence level of 95%
    if p_values < 0.05:
        print("Null hypothesis rejected. There is a significant difference\n")
        
***************************************************************************
               One-Way ANOVA for Parental Education Level
***************************************************************************

Significance Difference between Parental Education Levels for math_score
===========================================================================
F statistics: 6.521582600453217
P values: 5.592272384107223e-06
Null hypothesis rejected. There is a significant difference

Significance Difference between Parental Education Levels for reading_score
===========================================================================
F statistics: 9.289400382379963
P values: 1.16824570457051e-08
Null hypothesis rejected. There is a significant difference

Significance Difference between Parental Education Levels for writing_score
===========================================================================
F statistics: 14.442416127574992
P values: 1.1202799969771148e-13
Null hypothesis rejected. There is a significant difference

In [22]:
# Performing one-way ANOVA for different ethnicities
print("*"*75)
print(" "*15+"One-Way ANOVA for Ethnicities")
print("*"*75, end="\n"*2)

for subject in test_list:
    ethnicity_scores = racewise_scores[racewise_scores["Subject"]==subject]
    
    ethnicity_scores = [ethnicity_scores[ethnicity_scores["race_ethnicity"]==race]["Score"] for race in race_labels]
    
    # Perform one-way ANOVA
    f_statistics, p_values = f_oneway(*ethnicity_scores)
    print(f"Significance Difference between Ethnicities for {subject}")
    print("="*75)
    print("F statistics:", f_statistics)
    print("P values:", p_values)
    
    #Using a significance level of 0.05 ==> Confidence level of 95%
    if p_values < 0.05:
        print("Null hypothesis rejected. There is a significant difference\n")  
***************************************************************************
               One-Way ANOVA for Ethnicities
***************************************************************************

Significance Difference between Ethnicities for math_score
===========================================================================
F statistics: 14.593885166332635
P values: 1.3732194030370688e-11
Null hypothesis rejected. There is a significant difference

Significance Difference between Ethnicities for reading_score
===========================================================================
F statistics: 5.621659307419643
P values: 0.0001780089103235947
Null hypothesis rejected. There is a significant difference

Significance Difference between Ethnicities for writing_score
===========================================================================
F statistics: 7.162415174347504
P values: 1.0979189070067382e-05
Null hypothesis rejected. There is a significant difference

Conclusions from EDA¶

  • Scores are normally distributed.
  • Scores are strongly correlated especially reading and writing scores.
  • Less corelation between math score and the other two scores.
  • Lunch has a strong corellation with math score.
  • Lunch also has strong correlation with other scores.
  • Test Preparation also shows a strong correlation to the scores.
  • T-test shows that gender and lunch type may affect both the individual scores but also the average score.
  • ANOVA shows that parental education and ethnicity may affect both the individual scores but also the average score.

¶

Predictive Modelling¶

  • At this point we need to determine what we will use as input to the model we are going to train.

  • The demographic and background features will be definetly used as features for our model.

  • Defined Input Features: Select demographic and background features (lunch, test prep, gender, ethnicity, parental education) as inputs for modeling.

  • Defined Target Variables: At least 2 or more of the following scores (math, reding, writing).

In [23]:
x_features = data.drop(columns=["math_score","reading_score","writing_score","total_score","average_score"])
y_target = data[["math_score","reading_score","writing_score"]]

print(f"="*35+ " Features " +"="*35)
x_features.astype("category").describe()
=================================== Features ===================================
Out[23]:
gender race_ethnicity parental_level_of_education lunch test_preparation_course
count 1000 1000 1000 1000 1000
unique 2 5 6 2 2
top 0 group C some college 1 0
freq 518 319 226 645 642
In [24]:
print(f"="*13+ " Potential Targets " +"="*13)
y_target.describe()
============= Potential Targets =============
Out[24]:
math_score reading_score writing_score
count 1000.00000 1000.000000 1000.000000
mean 66.08900 69.169000 68.054000
std 15.16308 14.600192 15.195657
min 0.00000 17.000000 10.000000
25% 57.00000 59.000000 57.750000
50% 66.00000 70.000000 69.000000
75% 77.00000 79.000000 79.000000
max 100.00000 100.000000 100.000000

Outlier Treatment¶

  • Identify Outliers: Focus on detecting outliers in the score data using an IQR-based approach since scores are normally distributed.
  • Visualize Outliers: Plot 3D scatter plots to identify outliers and assess how scores differ.
  • Apply IQR Capping: Use IQR to cap extreme values, reducing the impact of outliers on model performance.
In [25]:
import plotly
# Configure Plotly to be rendered inline in the notebook.
plotly.offline.init_notebook_mode()

# Configure the trace.
trace = go.Scatter3d(
    x= data["math_score"],
    y=data["reading_score"],
    z=data["writing_score"],
    mode='markers',
    marker={
        'size': 10,
        'opacity': 0.8,
        'color': 'blue',

    }
)
trace.name = 'Initial Data'

# Configure the layout.
layout = go.Layout(
    margin={'l': 50, 'r': 0, 'b': 0, 't':30},
    scene=dict(xaxis=dict(title=dict(text='Math Score'),color='red'),
               yaxis=dict(title=dict(text='Reading Score'),color='red'),
               zaxis=dict(title=dict(text='Writing Score'),color='red')
              ),
    showlegend=True
)

data2 = [trace]

plot_figure = go.Figure(data=data2, layout=layout)
plot_figure.update_layout(title_font_color='red',title_font_size=20)
plot_figure.update_layout(title="3D Scatter Plot of Scores",paper_bgcolor='rgba(0,0,0,0)',plot_bgcolor='rgba(0,0,0,0)',font=dict(color='red'))
plot_figure.update_layout(font_family="Courier New",font_size=13)
plot_figure.update_layout(legend=dict(font=dict(size=20),title=dict(text="Data Type",font=dict(size=20))))
# Render the plot.
plotly.offline.iplot(plot_figure)
In [26]:
Q1 = y_target.quantile(0.25)
Q3 = y_target.quantile(0.75)
IQR = Q3 - Q1
threshold = 2.0
In [27]:
data_outliers = y_target[
    ((y_target["math_score"] < (Q1["math_score"] - threshold * IQR["math_score"])) |
    (y_target["math_score"] > (Q3["math_score"] + threshold * IQR["math_score"])) |
    (y_target["reading_score"] < (Q1["reading_score"] -threshold  * IQR["reading_score"])) |
    (y_target["reading_score"] > (Q3["reading_score"] +threshold  * IQR["reading_score"])) |
    (y_target["writing_score"] < (Q1["writing_score"] -threshold  * IQR["writing_score"])) |
    (y_target["writing_score"] > (Q3["writing_score"] +threshold  * IQR["writing_score"])))
]

print("Detected Outliers:")
data_outliers
Detected Outliers:
Out[27]:
math_score reading_score writing_score
59 0 17 10
596 30 24 15
980 8 24 23
In [28]:
trace2 = go.Scatter3d(
    x= data_outliers["math_score"],
    y=data_outliers["reading_score"],
    z=data_outliers["writing_score"],
    mode='markers',
    marker={
        'size': 10,
        'opacity': 0.9,
        'color': "red"
    }
)
trace2.name = 'Detected Outliers'
data2 = [trace,trace2]

plot_figure = go.Figure(data=data2, layout=layout)
# Render the plot.
plot_figure.update_layout(title_font_color='red',title_font_size=20)
plot_figure.update_layout(title="3D Scatter Plot of Scores",paper_bgcolor='rgba(0,0,0,0)',plot_bgcolor='rgba(0,0,0,0)',autosize=True,font=dict(color='red'))
plot_figure.update_layout(font_family="Courier New",font_size=13)
plot_figure.update_layout(legend=dict(font=dict(size=20),title=dict(text="Data Type",font=dict(size=20))))
# Render the plot.
plotly.offline.iplot(plot_figure)
In [29]:
lower_bound = Q1 - threshold * IQR
upper_bound = Q3 + threshold * IQR
filtered_data = y_target.copy()
for subject in test_list:
    filtered_data[filtered_data[subject]<lower_bound[subject]] = lower_bound[subject]
    filtered_data[filtered_data[subject]>upper_bound[subject]] = upper_bound[subject]
In [30]:
trace3 = go.Scatter3d(
    x= filtered_data["math_score"],
    y=filtered_data["reading_score"],
    z=filtered_data["writing_score"],
    mode='markers',
    marker={
        'size': 10,
        'opacity': .5,
        'color': "red"
    }
)
trace3.name = 'Data with Outlier Treatment'
trace.marker.opacity = 0.5
data3 = [trace,trace3]

plot_figure = go.Figure(data=data3, layout=layout)
# Render the plot.
plot_figure.update_layout(title_font_color='red',title_font_size=20)
plot_figure.update_layout(title="3D Scatter Plot of Scores",paper_bgcolor='rgba(0,0,0,0)',plot_bgcolor='rgba(0,0,0,0)',autosize=True,font=dict(color='red'))
plot_figure.update_layout(font_family="Courier New",font_size=12)
plot_figure.update_layout(legend=dict(font=dict(size=20),title=dict(text="Data Type",font=dict(size=20))))
fig.update_layout(title=dict(font=dict(size=40), yref='paper'))
# Render the plot.
plotly.offline.iplot(plot_figure)
In [31]:
# Create histogram for `filtered_data` (data without outliers)
hist_filtered = go.Histogram(
    x=filtered_data["math_score"],
    nbinsx=20,
    opacity=0.5,
    name='Math Score (Filtered)',
    marker=dict(color='blue'),
    histnorm="probability"
)

# Create histogram for `data` (data with outliers)
hist_outliers = go.Histogram(
    x=data["math_score"],
    nbinsx=20,
    opacity=0.5,
    name='Math Score with Outliers',
    marker=dict(color='red'),
    histnorm="probability"
    
)

# Combine both histograms into a single figure
fig = go.Figure(data=[hist_filtered, hist_outliers])

# Set title and axis labels
fig.update_layout(
    title="Math Score Distribution with and without Outliers",
    xaxis_title="Math Score",
    yaxis_title="Probability",
    barmode='overlay'  # Overlay both histograms
)
fig.update_layout(title=dict(font=dict(size=30), yref='paper'))
# Show the plot
fig.show()
In [32]:
# Create histogram for `filtered_data` (data without outliers)
hist_filtered = go.Histogram(
    x=filtered_data["reading_score"],
    nbinsx=20,
    opacity=0.5,
    name='Reading Score (Filtered)',
    marker=dict(color='blue'),
    histnorm="probability"
)

# Create histogram for `data` (data with outliers)
hist_outliers = go.Histogram(
    x=data["reading_score"],
    nbinsx=20,
    opacity=0.5,
    name='Reading Score with Outliers',
    marker=dict(color='red'),
    histnorm="probability"
    
)

# Combine both histograms into a single figure
fig = go.Figure(data=[hist_filtered, hist_outliers])

# Set title and axis labels
fig.update_layout(
    title="Reading Score Distribution",
    xaxis_title="Reading Score",
    yaxis_title="Probability",
    barmode='overlay'  # Overlay both histograms
)
fig.update_layout(font_size=14)
fig.update_layout(title=dict(text="Reading Score Distributions", font=dict(size=30), yref='paper'))
# Show the plot
fig.show()
In [33]:
# Create histogram for `filtered_data` (data without outliers)
hist_filtered = go.Histogram(
    x=filtered_data["writing_score"],
    nbinsx=20,
    opacity=0.5,
    name='Writing Score (Filtered)',
    marker=dict(color='blue'),
    histnorm="probability"
)

# Create histogram for `data` (data with outliers)
hist_outliers = go.Histogram(
    x=data["writing_score"],
    nbinsx=20,
    opacity=0.5,
    name='Writing Score with Outliers',
    marker=dict(color='red'),
    histnorm="probability"
    
)

# Combine both histograms into a single figure
fig = go.Figure(data=[hist_filtered, hist_outliers])

# Set title and axis labels
fig.update_layout(
    title="Writing Score Distribution",
    xaxis_title="Writing Score",
    yaxis_title="Probability",
    barmode='overlay'  # Overlay both histograms
)

fig.update_layout(font_size=14)
fig.update_layout(title=dict(font=dict(size=30), yref='paper'))
# Show the plot
fig.show()
  • No significant change is observed on the distributions of reading and writing after performing an outlier treatment.
  • Outlier treatment causes most change to the distribution of math score.

Test Scores Prediction Strategy¶

  • Features to be used: All of the columns for demographics and background will be included in order to predict the desired test scores.

  • Optional Prior Test Inclusion: Incorporate a known test score (math, reading, or writing) to improve the accuracy of predicting the other scores. This approach leverages prior knowledge for enhanced prediction.

In [34]:
# Select if you have a prior test result (math, reading, writing, none).
# If you have a prior test result then the rest will be predicted based on that.
# If you do not have a prior test result then you will be asked to take a test.

already_taken_test = "reading" # "math", "reading", "writing", "none"
In [35]:
X = x_features.copy()
tests_to_predict = ["math_score","reading_score","writing_score"]


if already_taken_test.lower() == "reading":
    X["reading_score"] = y_target["reading_score"]
    tests_to_predict.remove("reading_score")
    print("Reading score has been taken. Predicting Math and Writing scores")
elif already_taken_test.lower() == "writing":
    X["writing_score"] = y_target["writing_score"]
    tests_to_predict.remove("writing_score")
    print("Writing score has been taken. Predicting Math and Reading scores")
elif already_taken_test.lower() == "math":
    X["math_score"] = y_target["math_score"]
    tests_to_predict.remove("math_score")
    print("Math score has been taken. Predicting Reading and Writing scores")
elif already_taken_test.lower() == "none":
    print("No test has been taken yet. Predicting all scores")
else:
    print("Invalid test type. Please enter math, reading, writing or None")
    
Reading score has been taken. Predicting Math and Writing scores

Model Training and Tuning¶

  • Model Selection: Use Random Forest Regressor for flexibility and performance with categorical data.

  • Hyperparameter Tuning: Apply GridSearchCV to fine-tune the model’s hyperparameters (e.g., n_estimators, max_depth) for optimal performance.

  • Train-Test Split: Split data into training and testing sets to evaluate model generalizability.

  • Evaluate Model Performance: Calculate RMSE and R² for test and training sets to assess model accuracy and overfitting risk.

  • Cross-Validation: Perform cross-validation to assess the stability of the model and ensure consistent performance across different subsets of the data.

  • Evaluate Feature Importance: Assess which features are most influential in predicting test scores using the trained Random Forest model.

  • Identify Key Features: Display the top features for each test score prediction to understand what factors contribute most to the model’s predictions.

In [36]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score

predicted_values = []
labels_values = []
best_model_list = []

# Hyperparameter grid for GridSearchCV
param_grid = {
    'n_estimators': [200, 400, 600, 800, 1000],
    'max_depth': [5, 7, 9],
}

for subject in tests_to_predict:
    # Define features and target
    y = filtered_data[subject]  # Target column

    # One-hot encode categorical features
    one_hot_encoded_data = pd.get_dummies(X)

    # Perform train-test split
    X_train, X_test, y_train, y_test = train_test_split(one_hot_encoded_data, y, test_size=0.2, random_state=100)

    #=========================================================================
    # Hyperparameter tuning and model training
    #=========================================================================

    # Initialize the model
    model = RandomForestRegressor(random_state=100)

    # GridSearchCV for hyperparameter tuning
    grid_search = GridSearchCV(model, param_grid, cv=5, n_jobs=-1, scoring='neg_mean_squared_error')
    grid_search.fit(X_train, y_train)

    # Best model from grid search
    best_model = grid_search.best_estimator_

    # Train the best model
    best_model.fit(X_train, y_train)
    y_pred = best_model.predict(X_test)
    y_train_pred = best_model.predict(X_train)
    
    
    # Append the best model, the predictions and labels to a list
    predicted_values.append(y_train_pred)
    labels_values.append(y_train)
    best_model_list.append(best_model)
    
    #=========================================================================
    # Evaluate the model
    #=========================================================================

    # Calculate Mean Squared Error (RMSE)
    test_rmse_error = (mean_squared_error(y_test, y_pred))**0.5
    train_rmse_error = (mean_squared_error(y_train, y_train_pred))**0.5

    # Calculate R2 Score
    r2_score_value = r2_score(y_test, y_pred)
    r2_score_train = r2_score(y_train, y_train_pred)

    # Print results
    print("\n"+"="*70)
    print(f"Subject: {subject}")
    print("="*70)
    print(f"Test Root Mean Squared Error: \t{test_rmse_error}")
    print(f"Train Root Mean Squared Error: \t{train_rmse_error}")
    print(f"Test R2 Score: \t{r2_score_value}")
    print(f"Train R2 Score: \t{r2_score_train}\n")

    # Feature importance analysis
    feature_importance = best_model.feature_importances_
    important_features = pd.DataFrame({
    'Feature': one_hot_encoded_data.columns,
    'Importance': feature_importance
    }).sort_values(by='Importance', ascending=False)

    print("Top 10 Features Based on Importance:")
    print(important_features.head(10))
    print("\n")

    # Cross-validation to get an estimate of model performance
    cv_scores = cross_val_score(best_model, one_hot_encoded_data, y, cv=5, scoring='neg_mean_squared_error')
    print(f"Cross-validated Mean Squared Error: {(-cv_scores.mean())**0.5}")

    print(f"Best Parameters: {grid_search.best_params_}")
    print("="*70+"\n")

predicted_values = np.array(predicted_values).T
labels_values = np.array(labels_values).T

predicted_values = pd.DataFrame(predicted_values, columns=tests_to_predict)
labels_values = pd.DataFrame(labels_values, columns=tests_to_predict)
======================================================================
Subject: math_score
======================================================================
Test Root Mean Squared Error: 	6.199838967296009
Train Root Mean Squared Error: 	5.803453801000383
Test R2 Score: 	0.8089371591277158
Train R2 Score: 	0.8551245078947762

Top 10 Features Based on Importance:
                                           Feature  Importance
3                                    reading_score    0.827071
0                                           gender    0.138801
1                                            lunch    0.019183
8                           race_ethnicity_group E    0.003679
5                           race_ethnicity_group B    0.001853
2                          test_preparation_course    0.001581
6                           race_ethnicity_group C    0.001462
13        parental_level_of_education_some college    0.001434
11         parental_level_of_education_high school    0.001142
9   parental_level_of_education_associate's degree    0.001125


Cross-validated Mean Squared Error: 6.496883050504474
Best Parameters: {'max_depth': 5, 'n_estimators': 800}
======================================================================


======================================================================
Subject: writing_score
======================================================================
Test Root Mean Squared Error: 	4.378734131809081
Train Root Mean Squared Error: 	3.829437331572926
Test R2 Score: 	0.9156152853991701
Train R2 Score: 	0.9360840504808379

Top 10 Features Based on Importance:
                                         Feature  Importance
3                                  reading_score    0.982055
2                        test_preparation_course    0.005955
0                                         gender    0.003863
7                         race_ethnicity_group D    0.001621
14  parental_level_of_education_some high school    0.001380
11       parental_level_of_education_high school    0.001055
5                         race_ethnicity_group B    0.000936
6                         race_ethnicity_group C    0.000574
1                                          lunch    0.000523
12   parental_level_of_education_master's degree    0.000488


Cross-validated Mean Squared Error: 4.334388992093969
Best Parameters: {'max_depth': 5, 'n_estimators': 600}
======================================================================

Visualise Predictions¶

  • Visualize Predictions: Plot the predicted scores against actual scores for each subject to evaluate the model’s prediction accuracy.

  • Ideal Line for Comparison: Include a reference ideal line (y = x) to see how close the predicted values are to the actual values.
In [37]:
x_ideal= np.arange(0,100)
y_ideal=x_ideal
for subject in tests_to_predict:
    # print(subject)
    try:
        title_string =  f"{subject.replace('_',' ').title()} Predicted vs Actual Scores"
    except:
        title_string = f"{subject.title()} Predicted vs Actual Scores"
    plt.figure(figsize=(10,6))
    plt.title(f"{title_string}")
    plt.plot(x_ideal,y_ideal,color="red",label="Ideal Line")
    plt.scatter(labels_values[subject],predicted_values[subject],color="blue",label="Math Score") 
    plt.xlabel("Actual Scores")
    plt.ylabel("Predicted Scores")
plt.show()
No description has been provided for this image
No description has been provided for this image

Conclusions / Observations¶

  • Challenging to Predict Scores with Current Features: Predicting math, reading, and writing scores using only demographic features (lunch, test prep, ethnicity, parent education, and gender) is challenging and lacks accuracy.

  • One Known Score Improves Prediction of Others: Knowing one test score (e.g., math) significantly enhances the accuracy of predicting the other two scores (reading and writing).

  • Best Predictor for Performance Across Tests: A single known test score is the best indicator of a student’s performance in the other subjects. The code can be adjusted to experiment with different known test scores as inputs, setting a prior score for one subject or leaving it blank to simulate no prior knowledge.

  • Key Features for Predicting Math Scores: For math scores, the most important predictors are a prior test score and gender.

  • High Importance of Reading/Writing Scores for Each Other: When predicting reading scores, knowing the writing score is highly predictive (about 90% importance), and vice versa.